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Abstract 

A client-server architecture to simultaneously solve multiple learning 
tasks from distributed datasets is described. In such architecture, each 
client is associated with an individual learning task and the associated 
dataset of examples. The goal of the architecture is to perform information 
fusion from multiple datasets while preserving privacy of individual data. 
The role of the server is to collect data in real-time from the clients and 
codify the information in a common database. The information coded 
in this database can be used by all the clients to solve their individual 
learning task, so that each client can exploit the informative content of 
all the datasets without actually having access to private data of others. 
The proposed algorithmic framework, based on regularization theory and 
kernel methods, uses a suitable class of "mixed effect" kernels. The new 
method is illustrated through a simulated music recommendation system. 

1 Introduction 

The solution of learning tasks by joint analysis of multiple datasets is receiving 
increasing attention in different fields and under various perspectives. Indeed, 
the information provided by data for a specific task may serve as a domain- 
specific inductive bias for the others. Combining datasets to solve multiple 
learning tasks is an approach known in the machine learning literature as multi- 
task learning or learning to learn [551 IHl IMl M 1131 HI 132] • In this context, 
the analysis of the inductive transfer process and the investigation of general 
methodologies for the simultaneous learning of multiple tasks are important top- 
ics of research. Many theoretical and experimental results support the intuition 
that, when relationships exist between the tasks, simultaneous learning performs 
better than separate (single-task) learning pT 1 [76 l[77l[75l fT7 l [6lfT6 l[7^[54l l4]. 
Theoretical results include the extension to the multi-task setting of generaliza- 
tion bounds and the notion of VC-dimension [101 Ull 113] and a methodology for 
learning multiple tasks exploiting unlabeled data (the so-called semi-supervised 
setting) [g. 
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Importance of combining datasets is especially evident in biomedicine. In 
pharmacological experiments, few training examples are typically available for a 
specific subject due to technological and ethical constraints j20j|35]. This makes 
hard to formulate and quantify models from experimental data. To obviate this 
problem, the so-called population method has been studied and applied with suc- 
cess since the seventies in pharmacology [531 HH [ZSj • Population methods are 
based on the knowledge that subjects, albeit different, belong to a population 
of similar individuals, so that data collected in one subject may be informative 
with respect to the others [72l [50] . Such population approaches belongs to the 
family of so-called mixed-effect statistical methods. In these methods, clini- 
cal measurements from different subjects are combined to simultaneously learn 
individual features of physiological responses to drug administration [64 . Popu- 
lation methods have been applied with success also in other biomedical contexts 
such as medical imaging and bioinformatics [28l[l5]. Classical approaches postu- 
late finite-dimensional nonlinear dynamical systems whose unknown parameters 
can be determined by means of optimization algorithms [TTl 15^ [T] . Other 
strategies include Bayesian estimation with stochastic simulation [TH [HI 
and nonparametric population methods [271 li^ liSl ITH H51 [5T] . 

Information fusion from different but related datasets is widespread also in 
econometrics and marketing analysis, where the goal is to learn user prefer- 
ences by analyzing both user-specific information and information from related 
users, see e.g. [BHl [3 [H [31] ■ The so-called conjoint analysis aims to determine 
the features of a product that mostly influence customer's decisions. In the 
web, collaborative approaches to estimate user preferences have become stan- 
dard methodologies in many commercial systems and social networks, under the 
name of collaborative filtering or recommender systems, see e.g. |58j . Pioneering 
collaborative filtering systems include Tapestry [30] , GroupLens [ST] [3H] , Refer- 
ralWeb 36J, PHOAKS [57]. More recently, the collaborative filtering problem 
has been attacked with machine learning methodologies such as Bayesian net- 
works [H], MCMC algorithms [22], mixture models [34], dependency networks 
[33], maximum margin matrix factorization [65] . 

Coming back to the machine learning literature, in the single-task context 
much attention has been given in the last years to non-parametric techniques 
such as kernel methods [BD] and Gaussian processes [S5]. These approaches are 
powerful and theoretically sound, having their mathematical foundations in reg- 
ularization theory for inverse problems, statistical learning theory and Bayesian 
estimation [3 [TU] [S31 [Z31 [HI Hi] ■ The fiexibility of kernel engineering allows 
for the estimation of functions defined on generic sets from arbitrary sources of 
data. These methodologies have been recently extended to the multi-task set- 
ting. In [26] . a general framework to solve multi-task learning problems using 
kernel methods and regularization has been proposed, relying on the theory of 
reproducing kernel Hilbert spaces (RKHS) of vector- valued functions |44j . 

In many applications (e-commerce, social network data processing, recom- 
mender systems), real-time processing of examples is required. On-line multi- 
task learning schemes find their natural application in data mining problems 
involving very large datasets, and are therefore required to scale well with the 
number of tasks and examples. In [52], an on-line task- wise algorithm to solve 
multi-task regression problems has been proposed. The learning problem is for- 
mulated in the context of on-line Bayesian estimation, see e.g. [JHldS], within 
which Gaussian processes with suitable covariance functions are used to char- 
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acterize a non-parametric mixed-effect model. One of the key features of the 
algorithm is the capability to exploit shared inputs between the tasks in order 
to reduce computational complexity. However, the algorithm in |52j has a cen- 
tralized structure in which tasks are sequentially analyzed, and is not able to 
address neither architectural issues regarding the flux of information nor privacy 
protection. 

In this paper, multi-task learning from distributed datasets is addressed 
using a client-server architecture. In our scheme, clients are in a one-to-one 
correspondence with tasks and their individual database of examples. The role 
of the server is to collect examples from different clients in order to summarize 
their informative content. When a new example associated with any task be- 
comes available, the server executes an on-line update algorithm. While in [52] 
different tasks are sequentially analyzed, the architecture presented in this pa- 
per can process examples coming in any order from different learning tasks. The 
summarized information is stored in a disclosed database whose content is avail- 
able for download enabling each client to compute its own estimate exploiting 
the informative content of all the other datasets. Particular attention is paid 
to confidentiality issues, especially valuable in commercial and recommender 
systems, see e.g. [55l[T9]. First, we require that each specific client cannot ac- 
cess other clients data. In addition, individual datasets cannot be reconstructed 
from the disclosed database. Two kind of clients are considered: active and 
passive ones. An active client sends its data to the server, thus contributing to 
the collaborative estimate. A passive client only downloads information from 
the disclosed database without sending its data. A regularization problem with 
a parametric bias term is considered in which a mixed- effect kernel is used to 
exploit relationships between the tasks. Albeit specific, the mixed-effect non- 
parametric model is quite flexible, and its usefulness has been demonstrated in 
several works [i5l|Tf[ HOj 15^ . 

The paper is organized as follows. Multi-task learning with regularized kernel 
methods is presented in section [2] in which a class of mixed-effect kernels is also 
introduced. In section |3] an efhcient centralized off-line algorithm for multi- 
task learning is described that solves the regularization problem of section [2] 
In section |4j a rather general client-server architecture is described, which is 
able to efficiently solve online multi-task learning from distributed datasets. 
The server-side algorithm is derived and discussed in subsection |4.1[ while the 
client-side algorithm for both active and passive clients is derived in subsection 
|4.2[ In section [5j a simulated music recommendation system is employed to 
test performances of our algorithm. Conclusions (section [6]) end the paper. The 
Appendix contains technical lemmas and proofs. 

Notational preliminaries 

• X denotes a generic set with cardinality \X\. 

• A vector is an element of a € X" (an object with one index). Vectors are 
denoted by lowercase bold characters. Vector components are denoted by 
the corresponding non-bold letter with a subscript (e.g. Oi denotes the 
i-th component of a). 

• A matrix is an element of A G X"^™ (an object with two indices). Matri- 
ces are denoted by uppercase bold characters. Matrix entries are denoted 
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by the corresponding non-bold letter with two subscript (e.g. Aij denotes 
the entry of place of A). 

• Vectors y e M" are associated with column matrices, unless otherwise 
specified. 

• For all n e N, let [n] := {1,2,..., n}. 

• Let I denote the identity matrix of suitable dimension. 

• Let Ci e M" denote the i-th element of the canonical basis of M" (all zeros 
with 1 in position i): 

ei := ( ••• 1 ••• )^. 

• An (n,p) index vector is an object k e [n]^. 

• Given a vector a e X" and an (n, p) index vector k, let 

a(k) := ( afe, • • • Ofc^ ) e X^. 

• Given a matrix A G X"x™ and two index vectors k-"^ e k^, that are (n,pi) 
and {m,p2), respectively, let 



A(ki,k2) := 



/ Aulj^2 ••• A.Ult.2 \ 

1 P2 



• Finally, let 

A(:,k2) := A([n],k2), A(k\ :) := A(k\ [m])- 

Notice that vectors, as defined in this paper, are not necessarily elements of 
a vector space. The definition of "vector" adopted in this paper is similar to 
that used in standard object-oriented programming languages such as C-|— 1-. 



2 Problem formulation 

Let m e N denote the total number of tasks. For the task j, a vector of ij 
input-output pairs S-' € {X x M)^^ is available: 

sampled from a distribution Pj on X x R. The aim of a multi-task regression 
problem is learning m functions fj-.X^ M, such that expected errors with 
respect to some loss function L 

f L{y,fj{x))dPj 

JXxR 

are small. 
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Task-labeling is a simple technique to reduce multi-task learning problems to 
single-task ones. Task-labels are integers ti that identify a specific task, say ti G 
[m]. The overall dataset can be viewed as a set of triples S G {[m] x X x K)^, 
where £ := X]j=i is the overall number of examples: 

S ( {ti,xi,yi) ■ ■ ■ (ti,Xi,ye) ) . 

Thus, we can learn a single scalar- valued function defined over an input space 
enlarged with the task-labels f : X x [m] M. The correspondence between 
the dataset Sj and S is recovered through an {£,£j) index vector k-' such that 

Let H denote an RKHS of functions defined on the enlarged input space 
X X [m] with kernel K, and B denote a d-dimensional &ias-space. Solving the 
multi-task learning problem by regularized kernel methods amounts to finding 
f eU + B, such that 

/ = arg^min^ i,)) + , (1) 

where Vi : K x K — >■ K are suitable loss functions, A > is the regularization 
parameter and P-u is the projection operator into H. In this paper, the focus is 
on the mixed effect kernels, with the following structure: 



K{xi,ti,X2,t2) = aK{xi,X2) 



{l-a)Y,K'^{hMW{xi,X2). (2) 



where 

< a < 1. 



Kernels K and are defined on X x X and can possibly be all distinct. On 



the other hand, X;^ are "selector" kernels defined on [m] x [m] as 



Xf(<l,t2) 



1, ti = t2=j; 



T^""-'"^' \ 0, otherwhise 

Kernels Kip are not strictly positive. Assume that B is spanned by functions 
{ail)i}f. Of course, this is the same of using when a 7^ 0. However, 

weighting the basis functions by a is convenient to recover the separate approach 
(a — 0) by continuity. Usually, the dimension d is relatively low. A common 
choice might be d = 1 with tpi — a, that is B is simply the space of constant 
functions, useful to make the learning method translation- invariant. 

Under rather general hypotheses for the loss function V, the representer 
theorem, see e.g. [37], [59] gives the following expression for the optimum /: 



f{x, t) =al^ aiK{xi, x) + ^ bi-ipi{x) j 

\j=l i=l I 



(1 - a) ^ ^ a,K^p{U, t)K^{x^, x). 
i=i j=i 
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The estimate fj is defined to be the function obtained by plugging the corre- 
sponding task-label i = j in the previous expression. As a consequence of the 
structure of the expression of fj decouples into two parts: 



where 



fj{x):=fix,j) = fix) + f,{x), (3) 

/ i _ d \ 

f{x) = a I X! o.iK{xi,x) + ^ b,ipi{x) j , 

\i=l i=l / 

e 

fj{x) = (1 - a) ^ aik'{xi,x). 

Function / is independent of j and can be regarded as a sort of average task, 
whereas fj is a non-parametric individual shift. The value a is related to the 
"shrinking" of the individual estimates toward the average task. When a — 1, 
the same function is learned for all the tasks, as if all examples referred to an 
unique task (pooled approach). On the other hand, when a = 0, all the tasks 
are learned independently [separate approach), as if tasks were not related at 
all. 

Throughout this paper, the problem is specialized to the case of (weighted) 
squared loss functions 

V,{y,f{x,t))^^^{y^f{x,t)f, 

where w e denote a weight vector. For squared loss functions, coefficient 
vectors a and b can be obtained by solving the linear system [73] 



K + AW / a \ _ f y 

j I ab j ^ I 



(4) 



where 



W = diag(w), 
aK + (1 -a)^I(:,k^)K^(k^k^)I(k^ 

Kij = K{x, ,xj), = K'' (x, ,Xj), = ijjj (x, ) . 

For a = 0, vector b is not well determined. The linear system can be also solved 
via back-fitting on the residual generated by the parametric bias estimate: 



a 



*^(K + AW)"^*j b = *^(K + AW)"V, (5) 
(K-hAW)a = y-a*b. (6) 
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3 Complexity reduction 



In many applications of multi-task learning, some or all of the input data Xi are 
shared between the tasks so that the number of different basis functions appear- 
ing in the expansion ([s]) may be considerably less than i. As explained below, 
this feature can be exploited to derive efficient incremental online algorithms 
for multi-task learning. Introduce the vector of unique inputs 

X G X", such that Xi ^ Xj, Vi =/= j, 

where n < £ denote the number of unique inputs. For each task j, a new (n, £j) 
index vector h-' can be defined such that 

Let also 

a-' := a(k-'), := y(k-'), w-' :=w(k^). 

The information contained in the index vectors h-' is equivalently represented 
through a binary matrix P G {0, 1}^^", such that 

p _ ( 1, Xi = Xj 

\ 0, otherwise. 

We have the following decompositions: 

a:==P^a, K = PKP^, K := LDL^, * = P*, (7) 

where L e M"^'", D e W^"^ are suitable rank-r factors, D is diagonal, and 
LDL^ £ M"^". K G M"^" is a kernel matrix associated with the condensed 
input set x: K^j — K{xi,Xj), * G M"^'^, = ipjixi)- If K is strictly positive, 
we can assume r = n and L can be taken as a lower triangular matrix, see e.g. 

m- 

Solution ^ can be rewritten in a compact form: 

(n d 
aiK{xt, x) biipi{x) 
1=1 i=l 

+ {l-a)J2<KHx,^,^,x). 

i=l 

Introduce the following "compatibility condition" between kernel K and the 
bias space B. 

Assumption 1 There exists M € W"^'^ such that 

LDM = 

Assumption [l] is automatically satisfied in the no-bias case or when K is strictly 
positive. 

The next result shows that coefficients a and b can be obtained by solving 
a system of linear equations involving only "small-sized" matrices so that com- 
plexity depends on the number of unique inputs rather then the total number 
of examples. 
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Algorithm 1 Centralized off-line algorithm. 



1: R 

2; for j = 1 : m do 
3: W 



-1 



(1 - a)KJ(k^kJ) + AW(k^kJ) 

R ^ R + I{[£],y)WI(k^ , [£]) 
end for 
if a 7^ then 

Compute factors L, D, M 

y ^ L^P^Ry 

H4- (D i+aL^P^RPL) ' 

b ^ Solution to (M'^(D - H)M) b = M^Hy 

a ^ R [y - aPLH (y + Mb)] 
else 

a = Ry 
end if 



Theorem 1 Let Assumption^ hold. Coefficient vectors a and b can be evalu- 
ated through Algorithm^ For a — 0, h is undetermined. 

Algorithm [l] is an off-line (centralized) procedure whose computational com- 
plexity scales with 0{n^m + d'^). In the following section, a client-server on-line 
version of Algorithm [T] will be derived that preserves this complexity bound. 
Typically, this is much better than O ((^ -I- d)'^) , the worst-case complexity of 
directly solving Q. 

4 A client-server online algorithm 

Now, we are ready to describe the structure of the client-server algorithm. It 
is assumed that each client is associated with a different task. The role of the 
server is twofold: 

1. Collecting triples {xi,yi^Wi) (input-output-weight) from the clients and 
updating on-line all matrices and coefficients needed to compute estimates 
for all the tasks. 

2. Publishing sufficient information so that any client (task) j can inde- 
pendently compute its estimate possibly without sending data to the 
server. 

On the other hand, each client j can perform two kind of operations: 

1. Sending triples {xi,yi,Wi) to the server. 

2. Receiving information from the server sufficient to compute its own esti- 
mate fj. 

It is required that each client can neither access other clients data nor recon- 
struct their individual estimates. We have the following scheme: 

• Undisclosed Information: h-' , y-' , w-' , R^ , for j e [m] . 

• Disclosed Information: x, y, H. 
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w w -|--|- 

X y H 



hi 








h'" 


y"' 


w'" R'" 




Undisclosed Database 

[xi, Ui, Wi) 



Figure 1: The client-server scheme. 



4.1 Server side 

In order to formulate the algorithm in compact form, it is useful to introduce 
the functions "find" , "ker" and "bias" . Let 

A{x) :— {i : Xi — x} . 

For any p, g e N, a; G X, x e XP, y e let 



Rnd: X X XP [p+ 1] 

p + 1, A{x)^0, 
minA(a;), A{x)^&. 
keT{-,-;K) : XP X X'i MP^« 



find (a; , x) 



bias : XP RP""'^ 
bias (x) -J- = ipj{xi). 

The complete computational scheme is reported in Algorithm [2] The ini- 
tialization is defined by resorting to empty matrices whose manipulation rules 
can be found in 45J. In particular, h-', y^, w^, R^, D, L, M, x, y, H are all 
initialized to empty matrix. In this respect, it is assumed that functions "ker" 
and "bias" return an empty matrix as output, when applied to empty matrices. 

Algorithm [2] is mainly based on the use of matrix factorizations and matrix 
manipulation lemmas in the Appendix. The rest of this subsection is an exten- 
sive proof devoted to show that Algorithm [2] correctly updates all the relevant 
quantities when a new triple (xi,yi,Wi) becomes available from task j. Three 
cases are possible: 



1, 



The input Xi is already among the inputs of task j. 



2. The input Xi is not among the inputs of task j, but can be found in the 
common database x. 
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Algorithm 2 Server: receive {xi,yi,Wi) from client j and update the database. 



6: 
7: 
8: 
9: 

10: 



31: 
32: 
33: 
34: 
35: 
36: 
37: 



s = find (xi, x) 
if {s — n + 1) then 
n ^ n + 1, 

X <S- ( X ) , 

y 



k ^ kcr (xi,x; K^, 
ip ^ bias(xi), 

r Solution to LDr = k([n - 1]), 
P^kn- r'^Dr, 

M 

(V - r^DM) 



M 



XT , H 

11: H^l ^ 



12: D ^ 



13: 

14: 
15: 
16: 
17: 
18: 



D 

13 

L 

. 1 
end if 

p = find (a;i,x(h-?)) 
if {p = £j + l) then 



hJ ^ ( h^ 
19: y-' <e 

20: i 



22: U ^ 



25: ^ 

26: 
27: 
28: 

29: 

30: 



Vi 

Wi 

21: k (1 - a) • ker ^Xj,x(h-'); JC-' 

R^k([^,- - 1]) 
-1 

23: 7^1/ {\Wi - u'^k^ . 
24: H ^ JU^y-' , 

W 

^ 0^ 

else 

wl <(- wlwi/ (wl + Wi), 

'f^[Xiw^)y{w,-wl)-RQ-\ 
M ^ wj,{yi - yi)/{wi - wl) + ju^y^, 
end if 

L^(:,hJ)u 
z ^ Hv, 

y ^ y + Mv, 

H ^ H 



(07) ^+V^Z* 
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3. The input Xi is new. 
4.1.1 Case 1: repetition within inputs of task j 

The input Xi has been found in x(h^ ), so that it is also present in x. Thus, we 
have 

and the flow of Algorithm [2] can be equivalently reorganized as in Algorithm [3] 

Algorithm 3 Server (Case 1). 
1: s = find(a;i,x) 
2: p = find (a;i,x(h^)) 

3; wi ^ W^Wi/ [w^ + Wi), 

6: RJ(:,p), 

7; ^ RJ + 7UU^ 

8; ^ ^ w^iVi - yl,)/{w, - + 7u'^y', 

9; V L'^(:, h^)u 

10; y -^^ y + Aiv, 

11; z ^ Hv, 

12; H ^ H - , ,^^1 . 



Let r denote the number of triples of the type (x,yi,Wi) belonging to task 
j. These data can be replaced by a single triple {x, y, w) without changing the 
output of the algorithm. Let 



The part of the empirical risk regarding these data can be rewritten as 

2 — 1 



\i=l * i=l ' 1=1 V 

E 



2 

{f,ix)^-2f,ix)y 



2w ^-^ 2wi 

i—l 

2w 

where A is a constant independent of /. To recursively update w and y when a 
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repetition is deteeted, notice that 



w 



a- 


. ' ) 




Wr+1 J 


^ 1 


- ' ) 




Wr+1 J 



Wi \W' Wr+1 J \W Wr+1 
i—1 ^ I / \ 



W 



f + —{y^-yl■ 



By applying these formulas to the p-th data of task j, lines 3, 4 of Algorithm [3] 
are obtained. To check that is correctly updated by lines 5, 6, 7 of Algorithm 
just observe that, taking into account the definition of R-' and applying Lemma 
we have: 



(wi - W^pj 

= R-' H — 

K^iY/ (wi - wlj - ejRJep 

= R-'' + 7uu^. 

Consider now the update of y. Since y-' has already been updated, the previous 
y-' is given by 

- epAy^, 

where the variation Ay^ of y-j^ can be expressed as 

Wi - Wp 

Recalling the definition of y in Algorithm [l] and line 7 of Algorithm |3j we have 

h'=)R'=y'= + I(:, h^") (R^" + 7uu^) y^ 

By adding and subtracting Ay-j^Bp, using the definition of fi in line 8, 
(R^' + 7uu^) 

= (yJ' - Ay^ep) + (Ay^ + 7 (u^y^')) u 
= R^' (y-*' - Ay^ep) + fiu. 

Hence, 

y ^y + ^L^(:,h^)u 

By defining v as in line 9 of Algorithm |3j the update of line 10 is obtained. 
Finally, we show that H is correctly updated. Let 

F := aL^P^RPL. 
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Then, from the definition of H it follows that 

H = (D-i +F)~^ . 
In view of lines 7,9 of Algorithm [3j 

F ^ F + a7vv^, 

so that 

By Lemma [T] lines 11, 12 are obtained. 

4.1.2 Case 2: repetition in x. 

Since Xi belongs to x but not to x(h-'), we have 

p = ej + l. 

The flow of Algorithm |2] can be organized as in Algorithm |4] 

Algorithm 4 Server (Case 2) 
1 

2 
3 
4 



s = find(xi,x) 

p = find (a;i,x(h^)) 



£j ^ ij + l 

y ^ ( h^' s ) 



6; 

7: 



(1-a) -ker (xi,ic(h3); 



u 



wk{[e, - 1]) 

-1 



9; 7 -s- 1/ i^XWi ~ u-* k 
W-.W^lj, ^ \ + 7uu^ 
11; V -s- L^(:, h^)u 

12: ^^7U^y^ 

y ^ y + Mv, 
z ^ Hv, 



13: 
14: 
15: 



Hi H — 7 V?T1 



First, vectors h^ , y-' and w-* must be properly enlarged as in lines 3-6. Recalling 
the definition of R-' , we have: 



(R^)-i 



^ (R^-)-' m - 1]) 

k([^, -1])^ h^+Xw, 



The update for R-' in lines 7-10 is obtained by applying Lemma [2] with A 
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Consider now the update of y. Recall that and have already been 
updated. By the definition of y and in view of line 10 of Algorithm |4j we have 

in 

y^L^^I(:,h'=)RV 

k^3 









A 0" 


o) 


+ 7uu^ 



+ L^I(:,h-') 
= y + 7(uV)L^I(:,h^)u. 

The update in lines 11-13 immediately follows. Finally, the update in lines 14-15 
for H is obtained by applying Lemma [2] as in Case 1. 



4.1.3 Case 3: Xi is a new input. 



Algorithm 5 Server (Case 3) 
1; n -S— rt + 1 

2: X ( X ) . 

3: k ker , x; iir) , 
4; r ^ Solution to LDr = k([n — 1]), 
5; /3 ^ fc„ - r^Dr, 
6: -tr- bias(xi), 
- . / M 

D 



8: D 
9: L 
10: y 



13 
L 
1 



y 



12: Call Algorithm |4] 



Since Xi is a new input, we have 

s = n + 1 , p = £j + I. 

The flow of Algorithm [2] can be reorganized as in Algorithm [Sj The final part 
of Algorithm [5] coincides with Algorithm [4] However, the case of new input also 
requires updating factors D and L and matrix M. Assume that K is strictly 
positive so that D is diagonal and L is lower triangular. If K is not strictly 
positive, other kinds of decompositions can be used. In particular, for the linear 
kernel K{xi,X2) = {xi,X2) over M'', D and L can be taken, respectively, equal 



14 



to the identity and x. Recalling that K — LDL-^, we have 



, K k([n-l]) 
"^^l k([n-l]r K 

_ LDL^ k([n_- 1]) 
k([n-l])^ fc„ 

L 0\/D 0\/L 

1 A /3 j 1^ 1 

with r and /? as in lines 4-5. 

Concerning M, recall from Assumption [T] that 

LDM *. 



The relation must remain true by substituting the updated quantities. Indeed, 
after the update in lines 6-9, we have 

LDM ^[^^?^ ° ) ( ^^TDM) ) 

/ LDM 
^ r'^DM 4- - r^DM 




Finally, it is easy to see that updates for y and H are similar to that of previous 
Case 2, once the enlargements in lines 10-11 are made. 



4.2 Client side 

To obtain coefficients a by Algorithm[l] access to undisclosed data h-' , y-' , R-' is 
required. Nevertheless, as shown next, each client can compute its own estimate 
/j without having access to the undisclosed data. It is not even necessary to 
know the overall number m of tasks, nor their "individual kernels" K^: all the 
required information is contained in the disclosed quantities x, y and H. From 
the client point of view, knowledge of x is equivalent to the knowledge of K and 
'4'. In turn, also L, D and M can be computed using the factorization and 
the definition of M in Assumption [l] As mentioned in the introduction, two 
kind of clients are considered. 

• An active client j sends its own data to the server. This kind of client 
can request both the disclosed information and its individual coefficients 
a? (Algorithm [6|. 

• A passive client j does not send its data. In this case, the server is not able 
to compute a-' . This kind of client can only request the disclosed informa- 
tion, and must run a local version of the server to obtain (Algorithm 
0. 

The following Theorem ensures that vector a can be computed by knowing only 
disclosed data. 
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Theorem 2 Given x, y and the condensed coefficients vector a can he com- 
puted by solving the linear system 

DL^a = H (y + Mb) - DMb. 

Once the disclosed data and vector a have been obtained, each chent still 
needs the individual coefficients vector a? in order to perform predictions for its 
own task. While an active client can simply receive such vector from the server, 
a passive client must compute it independently. Interestingly, it turns out that 
a-' can be computed by knowing only disclosed data together with private data 
of task j. Indeed, line 11 of Algorithm [l] decouples with respect to the different 
tasks: 

a^' ^ IV [y^ - aL{hf :) (z + HMb)] . 

This is the key feature that allows a passive client to perform predictions without 
disclosing its private data and exploiting the information contained in all the 
other datasets. 



Algorithm 6 (Active client j) Receive x, y, H and a^ and evaluate a, b 
1: for i = 1 : n do 

2: k ker (£j,x([i]);i4r), 

3: r -s— Solution to LDr = k([i — 1]), 

4: /? -s- fci - r^Dr, 

7; i/j bias(a;i), 

( rMV'^r^DM) )' 

9; end for 

10: z ^ Hy 

11: b ^ Solution to (M^(D - H)M) b = M^z, 
12: a ^ Solution to (DL^) a = z + (H - D)Mb. 



5 Illustrative example: music recommendation 

In this section, the proposed algorithm is applied to a simulated music rec- 
ommendation problem, in order to predict preferences of several virtual users 
with respect to a set of artists. Artist data were obtained from the May 2005 
AudioScrobbler Database dump which is the last dump released by Audio- 
Scrobbler/LastFM under Creative Commons license. LastFM is an internet 
radio that provides individualized broadcasts based on user preferences. The 
database dump includes users playcounts and artists names so that it is possible 
to rank artists according to global number of playcounts. After sorting artists 
according to decreasing playcounts, 489 top ranking artists were selected. The 

^ http: //www-etud. iro .umontreal . ca/-bergstr j /audioscrobbler_data . htmll 
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Algorithm 7 (Passive client j) Receive x, y and H and evaluate a, b and au> 



for i = 1 : n do 

ker {xi,x{[i]); K) , 
r ^ Solution to LDr = k([i — 1]), 
P ki — r^Dr, 

D ' ° 



6; 



9: 
10: 
11: 
12: 
13: 
14: 
15: 
16: 



/3 
L 

7: ip ^ bias(a;i), 

M 



^ ^ ' /3-1 (-0 - r^DM) 
end for 

for z = 1 : do 

Run a local version of Algorithm [2] with {xij ,yij 
end for 
z ^ Hy 

b ^ Solution to (M^(D - H)M) b = M^z, 
a ^ Solution to (DL^) a = z + (H - D)Mb. 
^ W [yJ - aL(h^ :) (z + HMb)] . 



input space X is therefore a set of 489 artists, i.e. 

X — {Bob Marley, Madonna, Michael Jackson, ...} . 

The tasks are associated with user preference functions. More precisely, normal- 
ized preferences of user j over the entire set of artists are expressed by functions 
Sj : X —i' [0, 1] defined as 

where /, : A — > M are the tasks to be learnt. 

The simulated music recommendation system relies on music type classifi- 
cation expressed by means of tags (rock, pop, ...). In particular, the 19 main 
tags of LastFM were considered. The «-th artist is associated with a vector 
Zi e [0,1]^^ of 19 tags, whose values were obtained by querying the LastFM 
site on September 22, 2008. In Figure [2] the list of the tags considered in this 
experiment, together with an example of artist's tagging are reported. Vectors 
Zi have been normalized to lie on the unit hyper-sphere, i.e. ||zi||2 = 1. The 
input space data (artists together with their normalized tags) are available for 
download EI 

Tag information was used to build a mixed-effect kernel over X. More pre- 
cisely, K is a. Gaussian RBF kernel and K'' = K are linear kernels: 

A(x,(z,),x,(zj)) = ei-ll-'--.ll/2 ^ e^"^^ 

^http://www. lastfm.com 

"http: //www-dimat .unipv. it/~dinuzzo/f iles/mrdata.zip 
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The above kernels were employed to generate synthetic users. First, an 
"average user" was generated by drawing a function / : X — >■ M from a Gaussian 
process with zero mean and auto-covariance K. Then, m — 3000 virtual user's 
preferences were generated as 

/, = 0.257 + 0.75/;, 

where fj are drawn from a Gaussian process with zero mean and auto-covariance 
K. For the j-ih. virtual user, ij — 5 artists Xij were uniformly randomly sampled 
from the input space X, and corresponding noisy outputs yij generated as 

where are i.i.d. Gaussian errors with zero mean and standard deviation 
(7 = 0.01. The learned preference function Sj is 



1 + e-/j(^')/2 ' 



where fj is estimated using the algorithm described in the paper. Performances 
are evaluated by both the average root mean squared error 



RMSE = 



' ' i=l j=l 

and the average number of hits over the top 20 ranked artists, defined as 



TOP20HITS = — V hits20 



m 

1=1 



hits20j := |top20(sj) n top20(sj)| , 

where top20 : T-l — >■ returns the sorted vector of 20 inputs with highest 
scores, measured by a function s : X — > [0, 1], s e 

Learning was performed for 15 values of the shrinking parameter a linearly 
spaced in [0, 1] and 15 values of the regularization parameter A logarithmi- 
cally spaced in the interval [lO"'^, 10°] , see Figure [s] The multi-task approach, 
i.e. < a < 1 outperforms both the separate {a — 0) and pooled (a — 1) 
approaches. Interestingly, performances remain fairly stable for a range of val- 
ues of a. Figure |4] shows the distribution of hits20j over the 3000 users in 
correspondence with values of a* and A* achieving the optimal RMSE. Al- 
though a* = 0.0714 and A* = 3.1623 • 10"'' were selected so as to minimize 
the RMSE, remarkably good performances are obtained also with respect to 
the TOP20HITS score which is 8.3583 (meaning that on the average 8.3583 
artists among the top-20 are correctly retrieved). Finally, true and estimated 
top-20 hits are reported for the average user (Figure [5]) and two representative 
users (Figure [6]) . Artists of the true top-20 that are correctly retrieved in the 
estimated top-20 are reported in bold-face. 

Concerning the computational burden, it is worth observing that without 
exploiting the presence of repeated inputs and the mixed-effect structure of the 
kernel, the complexity of a naive approach would be of the order of the cube of 
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Figure 2: Example of artist tagging 

the overall number of examples, that is (5 • 3000)^. Conversely, the complexity 
of the approach proposed in the paper scales with n^m, where n is the number 
of unique inputs and m the number of tasks (in our example, n is bounded by 
the cardinality of the input set \X\ = 489, and m = 3000). 

6 Conclusions 

Recent studies have highlighted the potentialities of kernel methods applied to 
multi-task learning, but their effective implementation involve the solution of 
architectural and complexity issues. In this paper, emphasis is posed on the 
architecture with reference to learning from distributed datasets. For a general 
class of kernels with a "mixed-effect" structure it is shown that the optimal 
solution can be given a collaborative client-server architecture that enjoys fa- 
vorable computational and confidentiality properties. By interacting with the 
server, each client can solve its own estimation task while taking advantage of 
all the data from the other clients without having any direct access to them. 
Client's privacy is preserved, since both active and passive clients are allowed 
by the architecture. The former are those that agree to send their data to 
the server while the latter only exploit information from the server without 
disclosing their private data. The proposed architecture has several potential 
applications ranging from biomedical data analysis (where privacy issues are 
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Figure 4: Distribution of hits20j in correspondence with a* and A* achieving 
optimal RMSE. 
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Figure 5: True and estimated Top20 for the "average user". 
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Figure 6: True and estimated Top20 for two representative users. 
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crucial) to web data mining. An illustrative example is given by the simulated 
music recommendation system discussed in the paper. 
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Recall the following two lemmas on matrix inversions, see e.g. |31| . 

Lemma 1 (Sherman-Morrison- Woodbury) Let A e B e M'^xm 

he two nonsingular matrix, U G M"^™, V e K™^" such that (A + UBV) is 
nonsingular. Then, matrix 



E := (B~^ + VA^^U) 



is nonsingular, and 



(A + UBV) ^ = A 



A-^UE-^VA-^ 



Lemma 2 (Schur) Suppose that matrix 




is nonsingular, with B e M' 



. Then, 



E 



(C^B-^C - D) 



is nonsingular and 




Proof of Theorem [l]Let R be defined as in line f-5 of Algorithm [l| and 
observe that 



m 



R-1 = (f-a)^I(:,k^')K^'(k^k^')I(k^:) + AW 



Consider the back- fitting formulation ([5|, ([6| of the linear system Q. By 
Lemma [T] we have: 



(K + AW) 



aK + R 



1 



) 



^ (aPLDL^P^ + R^) 
= R aRPLHL^P^R. 
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Let F := aL^P^RPL, and observe that 

DFH = HFD = D H. 

In the following, we exploit the following relationship: 

aL^P^ (K + AW)"^ PL 
=aL^P^ (R - aRPLHL^P^R) LP 
=F FHF. 

Consider the case a 7^ 0. Then, in view of the previous relationship, recalling 
that * = P* = PLDM, we have 

a*^ (K + AW)"^* 
=aM'^D^L^P^ (K + AW)~^ PLDM 
=M^D (F - FHF) DM 
=M^ (D DFH) FDM 
=M^HFDM 
=M'^(D - H)M, 

and 

=M^DL^P^ (R aRPLHL^P^R) y 
=M^ (D DFH) L^P^Ry 
=M^Hy. 

Then, line 10 of Algorithm [l] follows from ([5]). Observe that 

(K + AW) a = aKa + R^^a. 

Then, from ^ we have 

a = R [y aPL (DL^P^a + DMb)] . 

Now, 

DL^P^a + DMb 
=DL^P^ (K + AW)^ (y aPLDMb) + DMb 
= (D - DFH) (L^P^Ry - FDMb) + DMb 
=HL^P^Ry - HFDMb + DMb 
=H (y + Mb) . 

Hence, we obtain line 11 of Algorithm [l] Finally, for a = 0, we have H = D so 
that the thesis follows. 
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Proof of Theorem [2]Let F AL'^P^TPL. Recalling the expression of a, y, 
H in Algorithm [l] we have 

DL^a = DL^P^a 

= DL'^P^R [y - APLH (y + Mb)] 

= D[y-FH(y + Mb)] 

= (D - DFH) y DFHMb 

= Hy (D H) Mb 

= H (y + Mb) DMb. 
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